Beware of the complete separation problem
Imagine your logistic regression model perfectly predicted the outcome, in that every individual
positive for the outcome had a predicted probability of 1.0, and every individual negative for the
outcome had a 0 predicted probability. This is called perfect separation or complete separation, and
the problem is called the perfect predictor problem. This is a nasty and surprisingly frequent problem
that’s unique to logistic regression, which highlights the sad fact that a logistic regression model will
fail to converge in the software if the model fits perfectly!
If the predictor variable or variables in your model completely separate the yes outcomes
from the no outcomes, the maximum likelihood method will try to make the coefficient of that
variable infinite, which usually causes an error in the software. If the coefficient is positive, the
OR tries to be infinity, and if it is negative, it tries to be 0. The SE of the OR tries to be infinite,
too. This may cause your CI to have a lower limit of 0, an upper limit of infinity, or both.
Check out Figure 18-8, which visually describes the problem. The regression is trying to make the
curve come as close as possible to all the data points. Usually it has to strike a compromise, because
there’s a mixture of 1s and 0s, especially in the middle of the data. But with perfectly separated data,
no compromise is necessary. As b becomes infinitely large, the logistic function morphs into a step
function that touches all the data points (observe where b = 5).
While it is relatively easy to identify if there is a perfect predictor in your data set by looking
at frequencies, you may run into the perfect predictor problem as a result of a combination of
predictors in your model. Unfortunately, there aren’t any great solutions to this problem. One
proposed solution called the Firth correction allows you to add a small number roughly
equivalent to half an observation to the data set that will disrupt the complete separation. If you
can do this correction in your software, it will produce output, but the results will likely be
unstable (very near 0, or very near infinity). The approach of trying to fix the model by changing
the predictors would not make sense, since the model fits perfectly. You may be forced to
abandon your logistic regression plans and instead provide a descriptive analysis.